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Abstract 

In this paper we study the problem of recovering a low-rank matrix from linear measurements. Our 
algorithm, which we call Procrustes Flow, starts from an initial estimate obtained by a thresholding 
scheme followed by gradient descent on a non-convex objective. We show that as long as the measurements 
obey a standard restricted isometry property, our algorithm converges to the unknown matrix at a 
geometric rate. In the case of Gaussian measurements, such convergence occurs for a ni x 712 matrix of 
rank r when the number of measurements exceeds a constant times (ni -f n2)r. 


1 Introduction 

Low rank models are ubiquitous in machine learning, and over a decade of research has been dedicated to 
determining when such models can be efficiently recovered from partial information [Faz02, RS05, CR09]. 
See [DR16] for an extended survey on this topic. The simplest such recovery problem concerns how can we 
can find a low-rank matrix obeying a set of linear equations? What is the computational complexity of such 
an algorithm? More specifically, we are interested in solving problems of the form 

min rank(AT) s.t. A{M) = b, (1-1) 

MGR"! X"2 

where A : ffi.™ is a known affine transformation that maps matrices to vectors. More specifically, 

the fc-th entry of A{X) is {Ak,X) := Tr(A.JX), where each Ak € 

Since the early seventies, a popular heuristic for solving such problems has been to replace M with a 
low-rank factorization M = UV~^ and solve matrix bilinear equations of the form 

find s.t. A(UV'^) = b, (1.2) 

[7GR"lXr 

via a local search heuristic [Ruh74]. Many researchers have demonstrated that such heuristics work well in 
practice for a variety of problems [RS05, Fun06, LRS+10, RR13]. However, these procedures lack strong 
guarantees associated with convex programming heuristics for solving (1.1). 

In this paper we show that a local search heuristic solves (1.2) under standard restricted isometry as¬ 
sumptions on the linear map A. For standard ensembles of equality constraints, we demonstrate that M can 
be estimated by such heuristics as long as we have r2((ni -l-n 2 )r) equations.^ This is merely a constant factor 
more than the number of parameters needed to specify a ni x n 2 rank r matrix. Specialized to a random 
Gaussian model and positive semidefinite matrices, our work improves upon recent independent work by 
Zheng and Lafferty [ZL15]. 
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2 Algorithms 

In this paper we study a local search heuristic for solving matrix bilinear equations of the form (1.2) which 
consists of two components: (1) a careful initialization obtained by a projected gradient scheme on m x n 2 
matrices, and (2) a series of successive refinements of this initial solution via a gradient descent scheme. 
This algorithm is a natural extension of the Wirtinger Flow algorithm developed in [CLS15] for solving 
vector quadratic equations. Following [CLS15], we shall refer to the combination of these two steps as the 
Procrustes Flow (PF) algorithm. We shall describe two variants of our algorithm based on whether the 
sought after solution M is positive semidefinite or not. The former is detailed in Algorithm 1, and the latter 
in Algorithm 2. 

The initialization phase of both variants is rather similar and is described in Section 2.1. The successive 
refinement phase is explained in Section 2.2 for positive semidefinite (PSD) matrices and in Section 2.3 for 
arbitrary matrices. Throughout this paper when describing the PSD case, we assume the size of the matrix 
is M is n X n, i.e. ni = n 2 = n. 


2.1 Initialization via low-rank projected gradients 

In the initial phase of our algorithm we start from Mq = 0„^xn2 apply successive updates of the form 


Mt+ 1 — Vr 


Mr - ar+l ^ ((Afe, Mr) - 
. k—1 



( 2 . 1 ) 


on rank r matrices of size rii x 712 - Here, Vr denotes projection onto either rank-r matrices or rank-r PSD 
matrices, both of which can be computed efficiently via Lanczos methods. We run (2.1) for Tg iterations 
and use the resulting matrix Mtq for initialization purposes. In the PSD case, we set our initialization to 
an n X r matrix Uq obeying Mtq = UoUq . In the more general case of rectangular matrices we need to use 
two factors. Let Mt^ = Cto^TqD^^ be the Singular Value Decomposition (SVD) of Mt^- We initialize our 

algorithm in the rectangular case by setting Uq = Vq = ■ 

Updates of the form (2.1) have a long history in compressed sensing/matrix sensing literature (see e.g. 
[TG07, GK09, NT09, NV09, BD09, MJD09, GGSIO]). Furthermore, using the first step of the update (2.1) 
for the purposes of initialization has also been proposed in previous work (see e.g. [AM07, KMOlO, JNS13]). 


2.2 Successive refinement via gradient descent — positive semidefinite case 

We first focus on the PSD case. As mentioned earlier, we are interested in finding a matrix U G 
obeying matrix quadratic equations of the form A{UU~^) = b. We wish to refine our initial estimate by 
solving the non-convex optimization problem 

4 - '•III = 4 - hf. (2.2) 

which minimizes the misfit in our quadratic equations via the square loss. To solve (2.2), starting from our 
initial estimate Uq G we apply the successive updates 

Ur+l := Ur - ^^Vf{Ur) = Ur - ((Afe, C/.C/J) - 6fe)A.C/. | . (2.3) 

ll^oll ll^oll \k^i / 

Here and throughout, for a matrix X, at{X) denotes the Gth largest singular value of X, and ||X|| = cri(X) 
is the operator norm. We note that the update (2.3) is essentially gradient descent with a carefully chosen 
step size. 
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Algorithm 1 Procrustes Flow (PF) 

Require: {AJ- ,, {aj-i, i, Tq £ N. 

// Initialization phase. 

-^^0 Onxn- 

for T = 0,1,..., To — 1 do 

// Projection onto rank r PSD matrices. 

Mr+l i- VriMr - Or+l ET=li{^k,Mr) - bk)Ak). 

end for ^ 

// SVD of Mto, with Q € G 

QSQT :=Mt„. 

Uo := QSi/2. 

// Gradient descent phase, 
repeat 

17.+1 ^Ur-^ UrUj) - bk)AkUr). 

until convergence 


2.3 Successive refinement via gradient descent — general case 

We now consider the general case. Here, we are interested in finding matrices U G and V G M^^xr 

obeying matrix quadratic equations of the form b — A{UV~^). In this case, we refine our initial estimate 
by solving the non-convex optimization problem 


min 

[7gRniXr_ VeK"2Xr- 


g{U,V) :=l\\A{UV^)-b\\l 


— \\U~^U-V~^V 

16 " 


2 

F ■ 


(2.4) 


Note that this is similar to (2.2) but adds a regularizer to measure mismatch between U and V. Given a 
factorization M = UV~^, for any invertible r x r matrix P, UP and VP~^ is also a valid factorization. 
The purpose of the second term in (2.4) is to account for this redundancy and put the two factors on “equal 
footing”. To solve (2.4), starting from our initial estimates Uq and Vq we apply the successive updates 

Ur+l :=Ur--l^Vcjg(Ur,Vr) 

= - V/V.) ) (2.5) 

Vfc=l / 


Vr + 1 := Vr - -^^VvgiUr, Vr) 


= Vr- 


ll^oll 

f^T + 1 

II 


J2((Ak,UrVj} - b,)AjUr + jVriVjK - Uj Ur) ■ 


\k^l 


Again, (2.5) and (2.6) are essentially gradient descent with a carefully chosen step size. 


( 2 . 6 ) 


3 Main Results 

For our theoretical results we shall focus on affine maps A which obey the matrix Restricted Isometry 
Property (RIP). 
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Algorithm 2 Rectangular Procrustes Flow (RPF) 


Require: {AJ- ,, {aj-i, {/r,}-i, Tq £ N. 

// Initialization phase. 

-A/q •— 0niXn2- 

for T = 0,1,To — 1 do 

// Projection onto rank r matrices. 

Mr+l i- VriMr - Or+l ET=li{^k,Mr) - bk)Ak). 

end for 


// SVD of Mto, with C G . 

C'ED^ := Mto- 
Uo := CSi/2. 

Vo := DSi/2. 


// Gradient descent phase, 
repeat 


Ut +1 ^ Ur /Tr+l-p^ 

Vr+l ^ Vr — Mt+1 IIVqII^ 
until convergence 


(Er=l((^fe: t^rV 7 ) - hk)AkVr + 
(Er=i((^fc.t/rV7) - bk)AiUr + - 


-V7V.)). 

UjUr)). 


Definition 3.1 (Restricted Isometry Property (RIP) [CT05, RFPIO]). The map A satisfies r-RIP with 
constant Sr, if 


{l-5r)\\X\\l<\\A{X)\\l<{l + Sr)\\X\\l, 

holds for all matrices X £ R"i ^”2 gj rank at most r. 

As mentioned earlier it is not possible to recover the factors U and V in (1.2) exactly. For example, in 
the PSD case it is only possible to recover D up to a certain rotational factor as if U obeys (3.5), then so 
does any matrix UR with i? £ R’’^’’ an orthonormal matrix satisfying RJ R = U- This naturally leads to 
defining the distance between two matrices U,X G R"^'' as 

(3.1) 

We note that this distance is the solution to the classic orthogonal Procrustes problem (hence the name of the 
algorithm). It is known that the optimal rotation matrix R minimizing ||[/ — AR||p is equal to i? = ABA, 
where ASJB^ is the singular value decomposition (SVD) of X~^U. We now have all of the elements in place 
to state our main results. 

3.1 Quadratic measurements 

When the low-rank matrix M £ R"^” is PSD we are interested in finding a matrix U £ R"^’’ obeying 
quadratic equations of the form 


AiUU^) = b, (3.2) 

where we assume b = A{M) for a planted rank-r solution M = XX^ £ R"^" with X £ R"^’'. We wish to 
recover X. This is of course only possible up to a certain rotational factor as if U obeys (3.5), then so does 
any matrix UR with R £ R’’^’’ an orthonormal matrix satisfying RJR = Ir- Our first theorem shows that 
Procrustes Flow indeed recovers X up to this ambiguity factor. 

Theorem 3.2. Let M £ R"^” be an arbitrary rank-r symmetric positive semidefinite matrix with singular 
values ai{M) > a 2 {M) > ■■■ > ar{M) > 0 and condition number k = ai{M)/ar{M). Assume M = XX^ 
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for some X G and let b = A{M) £ R™ be m linear measurements. Furthermore, assume the mapping 

A obeys rank-Qr RIP with RIP constant dgr < 1/10. Also let ar = 1/m for all r = 1,2,.... Then, using 
Tq > log(-yrK) + 2 iterations of the initialization phase of Procrustes Flow as stated in Algorithm 1 yields a 
solution Uo obeying 


dist(U'o,X) < (3.3) 

Furthermore, take a constant step size p,r = jJ- for all r = 1, 2,..., with p, < 36/425. Then, starting from 
any initial solution obeying (3.3), the r-th iterate of Algorithm 1 satisfies 

dist{Ur,X)<^(^l-^l^y ar{X). (3.4) 

3.2 Bilinear measurements 

In the more general case when the low-rank matrix M £ is rectangular we are interested in finding 

matrices U £ R"i^’', V £ obeyiirg bilinear equations of the form 

A(UV~^) = b, (3.5) 

where we assuirre b = A{M) for a planted rank-r solution M £ -^vith M = XY~^ where X £ 

and Y £ R" 2 xr_ ^.gain we wish to recover the factors X and Y. The next theorem shows that we can also 
provide a guarantee siirrilar to that of Theorem 3.2 for this more general rectangular case. 

Theorem 3.3. Let M £ R”i ^"2 arbitrary rank-r matrix with singular values o'i(M) > a 2 {M) > ... > 

ar{M) > 0 and condition number k = ai{M)/o'r{M). Let M = AUBA he the SVD of M and define X = 
yl 5 ]i /2 g and Y = Also, let b = A{M) £ R™ be m linear measurements where the 

mapping A obeys rank-dr RIP with RIP constant Ssr "£ 1/25. Also let ar = 1/m for all r = 1,2,.... Then, 
using Tq > 3 log(-yrK) -|- 5 iterations of the initialization phase of Procrustes Flow as stated in Algorithm 2 
yields a solution Uq,Vq obeying 


dist 


Uo 

Vi) 



(3.6) 


Furthermore, take a constant step size = /i for all r = 1,2,... and assume fi < 2/187. Then, starting 
from any initial solution obeying (3.6), the r-th iterate of Algorithm 2 satisfies 


dist 




1 

< - 

- 4 



<Jr{X). 


(3.7) 


The above theorem shows that Procrustes Flow algorithm achieves a good initialization under the RIP 
assumptions on the mapping A. Also, starting from any sufficiently accurate initialization the algorithm 
exhibits geometric convergence to the unknown matrix M. We note that in the above result we have not 
attempted to optimize the constants. Furthermore, there is a natural tradeoff involved between the upper 
bound on the RIP constant, the radius in which PF is contractive (3.6), and its rate of convergence (3.7). 
In particular, as it will become clear in the proofs one can increase the radius in which PF is contractive 
(increase the constant 1/4 in (3.6)) and the rate of convergence (increase the constant 4/25 in (3.7)) by 
assuming a smaller upper bound on the RIP constant. 

The most common measurement ensemble which satisfies the isotropy and RIP assumptions is the Gaus¬ 
sian ensemble here each matrix Ak has i.i.d. Af{0,1/m) entries.^ For this ensemble to achieve a RIP constant 

^We note that in the PSD case the so called spiked Gaussian ensemble would be the right equivalent. In this case each 
symmetric matrix Aj, has Af(0,1/m) entries on the diagonal and Af(0, l/2m) entries elsewhere. 
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of Sr, we require at least m = fl(-^nr) measurements. Using equation (3.7) together with a simple calculation 
detailed in Appendix D, we can conclude that for Mr = UrVj, we have 

\\Mr - M\\p < • dist 

-I;)’- 

Thus, applying Theorem 3.3 to this measurement ensemble, we conclude that the Procrustes Flow algorithm 
yields a solution with relative error {\\Mr — M\\p / ||M||^ < e) in 0(Klog(l/e)) iterations using only ^l{nr) 
measurements. We would like to note that if more measurements are available it is not necessary to use 
multiple projected gradient updates in the initialization phase. In particular, for the Gaussian model if 
m = U(nr^K^), then (3.3) will hold after the first iteration (Tq = !)• 



How to verify the initialization is complete. Theorems 3.2 and 3.3 require that Tq = f^(log(-\A’^))j 
but K is a property of M and is hence unknown. However, under the same hypotheses regarding the RIP 
constant in Theorems 3.2 and 3.3, we can use each iterate of initialization to test whether or not we have 
entered the radius of convergence. The following lemma establishes a sufficient condition we can check using 
only information from Mr ■ We establish this result only in the symmetric case- the extension to the general 
case is straightforward. The proof is deferred to Appendix B. 

Lemma 3.4. Assume the RIP constant of A satisfies S 2 r < 1/10. Let Mr denote the r-th step of the 
initialization phase in Algorithm 1, and let Uq G he the such that Mr = UoUq . Define 


A{Mr) - b 


A{Mr - XX^) 


Then, if 


we have that 


20 


^{Mr) , 


dist(Lro,X) < ^ar(X} . 


One might consider using solely the projected gradient updates (i.e. set Tq = oo) as in previous approaches 
[TG07, GK09, NT09, NV09, BD09, MJD09, CCSIO]. We note that the projected gradient updates in the 
initialization phase require computing the first r singular vectors of a matrix whereas the gradient updates do 
not require any singular vector computations. Such singular computations may be prohibitive compared to 
the gradient updates, especially when ni or n 2 is large and for ensembles where matrix-vector multiplication 
is fast. We would like to emphasize, however, that for small ni,n 2 and dense matrices using projected 
gradient updates may be more efficient. Our scheme is a natural interpolation: one could only do projected 
gradient steps, or one could do one projected gradient step. Here we argue that very few projected gradients 
provide sufficient initialization such that gradient descent converges geometrically. 


4 Related work 

There is a vast literature dedicated to low-rank matrix recovery/sensing and semidefinite programming. We 
shall only focus on the papers most related to our framework. 
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Recht, Fazel, and Parrilo were the first to study low-rank solutions of linear matrix equations under 
RIP assumptions [RFPIO]. They showed that if the rank-r RIP constant of A is less than a fixed numerical 
constant, then the matrix with minimum trace satisfying the equality constraints coincided with the minimum 
rank solution. In particular, for the Gaussian ensemble the required number of measurements is H(nr) [CPU]. 
Subsequently, a series of papers [CR09, Groll, Recll, CLSI4] showed that trace minimization and related 
convex optimization approaches also work for other measurement ensembles such as those arising in matrix 
completion and related problems. In this paper we have established a similar result to [RFPIO]. We require 
the same order of measurements II (nr) but use a more computationally friendly local search algorithm. Also 
related to this work are projection gradient schemes with hard thresholding [TG07, GK09, NT09, NV09, 
BD09, MJD09, CCSIO]. Such algorithms enjoy similar guarantees to that of [RFPIO] and this work. Indeed, 
we utilize such results in the initialization phase of our algorithm. However, such algorithms require a rank-r 
SVD in each iteration which may be expensive for large problem sizes. We would like to emphasize, however, 
that for small problem sizes and dense matrices (such as Gaussian ensembles) such algorithms may be faster 
than gradient descent approaches such as ours. 

More recently, there has been a few results using non-convex optimization schemes for matrix recovery 
problems. In particular, theoretical guarantees for matrix completion have been established using manifold 
optimization [KMOlO] and alternating minimization [Kesl2] (albeit with the caveat of requiring a fresh set 
of samples in each iteration). See also [Harl4, SL15]. Later on, Jain et.al. [JNS13] analyzed the performance 
of alternating minimization under similar modeling assumptions to [RFPIO] and this paper. However, the 
requirements on the RIP constant in [JNS13] are more stringent compared to [RFPIO] and ours. In particular, 
the authors require S 4 r < cjr whereas we only require Jer- < c. Specialized to the Gaussian model, the results 
of [JNS13] require II(nr^K^) measurements.^ 

Our algorithm and analysis are inspired by the recent paper [GLS15] by Gandes, Li and Soltanolkotabi. 
See also [Solid, GLM15] for some stability results. In [CLS15] the authors introduced a local regularity 
condition to analyze the convergence of a gradient descent-like scheme for phase retrieval. We use a similar 
regularity condition but generalize it to ranks higher than one. Recently, independent of our work, Zheng and 
Lafferty [ZL15] provided an analysis of gradient descent using (2.2) via the same regularity condition. Zheng 
and Lafferty focus on the Gaussian ensemble, and establish a sample complexity of m = H(nr^K^ log n). 
In comparison we only require fl(nr) measurements removing both the dependence on k in the sample 
complexity and improving the asymptotic rate. We would like to emphasize that the improvement in our 
result is not just due to the more sophisticated initialization scheme. In particular, Zheng and Lafferty show 
geometric convergence starting from any initial solution obeying dist([/o, A") < c • ar{X) as long as the 
number of measurements obeys m = fl{nrK^ logn). In contrast, we establish geometric convergence starting 
from the same neighborhood of Uq with only H(nr) measurements. Our results also differs in terms of the 
convergence rate. We establish a convergence rate of the form 1 — ^ whereas [ZL15] establishes a slower 
convergence rate of the form 1— ■ Moreover, the theory of restricted isometries in our work considerably 

simplifies the analysis. 

Finally, we would also like to mention [SOR15] for guarantees using stochastic gradient algorithms. The 
results of [SOR15] are applicable to a variety of models; focusing on the Gaussian ensemble, the authors re¬ 
quire H ((nr log n)/e) samples to reach a relative error of e. In contrast, our sample complexity is independent 
of the desired relative error e. However, their algorithm only requires a random initialization. 

Since the first version of this paper appeared on arXiv, a few recent papers have also studied low-rank 
recovery from RIP measurements via Procrustes Flow type schemes [BKS15, ZWLI5, CW15]. We would 
like to point out that the results presented in these papers are suboptimal compared to ours. For example, 
by utilizing some of the results of the previous version of this paper, [BKS15] provides a similar convergence 
rate to ours. However, this convergence occurs in a smaller radius around the planted solution so that the 
required number of measurements is significantly higher. Furthermore, the results of [BKS15] only apply 
when the matrix is PSD and do not work for general rectangular. Similarly, result in [GWI5] holds only 

^The authors also propose a stage-wise algorithm with improved sample complexity of Q{nAk^) where k is a local condition 
number defined as the ratio of the maximum ratio of two successive eigenvalues. We note, however, that in general k can be as 
large as k. 
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for PSD matrices, and the convergence rate has a high-degree polynomial dependence on condition number. 
The algorithm from [CW15] does generalize to rectangular matrices, but the sample complexity is of the 
order of 0{nr^ \ogn) rather than the complexity 0{nr) we establish here. Moreover, our analysis of both 
the PSD and rectangular cases is far more concise. 

5 Proofs 

We first prove our results for the symmetric PSD case (Theorem 3.2). However, whenever possible we will 
state lemmas in the more general setting. The changes required for the proof of the general setting (Theorem 
3.3) is deferred to Section 5.4. 

Recall in this setting that we assume a fixed symmetric PSD M G of rank r, which admits a 

factorization M = XX~^ for X G Before we dive into the details of the proofs, we would like to 

mention that we will prove our results using the update 

17.+ 1 =Ur- -^Vf{Url (5.1) 

11-^ II 

in lieu of the PF update 

Ur+l = Ur- f{Ur). (5.2) 

\\Uof 

As we prove in Section 5.3, our initial solution obeys dist([7o,-It) < crr(-t)/4. Hence, applying triangle 
inequality we can can conclude that 

25 

\\Uor<-\\Xf, (5.3) 

and similarly, 

||C/of >^||Xf . (5.4) 

Thus, any result proven for the update (5.1) will automatically carry over to the PF update with a simple 
rescaling of the upper bound on the step size via (5.3). Furthermore, we can upper bound the convergence 
rate of gradient descent using the PF update in terms of properties of X instead of Uq via (5.4). 

5.1 Preliminaries 

We start with a well known characterization of RIP. 

Lemma 5.1. [Can08] Let A satisfy 2r-RIP with constant S 2 r- Then, for all matrices X, Y of rank at most 

T WG hdVG 

|(M(X),A(F))-(X,y)|<52.||X||^||lF||^ . 

Next, we state a recent result which characterizes the convergence rate of projected gradient descent onto 
general non-convex sets specialized to our problem. See [MJD09] for related results using singular value hard 
thresholding. Throughout, Vr{XI) denotes projection onto rank-r matrices. For a symmetric PSD matrix 
M G denotes projection onto the rank-r PSD matrices and for a rectangular matrix M G 

denotes projection onto rank-r matrices. 

Lemma 5.2. [ORS15] Let M G arbitrary matrix of rank r. Also let b = A{M) G K™ be m 

linear measurements. Consider the iterative updates 

( 1 

Zr+l ^ T’r ( Zr — — Zr) — bk)Ak 

\ ™ k=l 
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Then 

\\Z^-M\\p<piAr\\Zo-M\\p, 

holds. Here, p{A) is defined as 

p{A):=2 sup \{A{X),A{Y))-{X,Y)\. 

||X||p = l,rank(X)<2r, 

||X||p = l,rank(X)<2r 


We shall make repeated use of the following lemma which upper bounds ||C/i7^ — by some 

factor of dist(i7, X) . 

Lemma 5.3. For any U G obeying dist(t/, X) < j ||X||, we have 

WUU'^ - XX'^Wp < ^||X||dist(U',X). 

Proof. 

\\UU'^ - = \\U{U - xny + {U- XR){XRf\\^ 

<(\\U\\ + \\X\\)\\U-XR\\^ 

<l\\X\\ \\U-XR\\^ . 


□ 


Finally, we also need the following lemma which upper bounds dist(17, X) by some factor of — XX^ 

We defer the proof of this result to Appendix A. 

Lemma 5.4. For any U,X G we have 


dist^(U',X) < 


1 

2(^-l)a2(X) 


UU~^ - XX~^ 


2 

F ■ 


We would like to point out that the dependence on (t^(X) in the lemma above is unavoidable. 


5.2 Proof of convergence of gradient descent updates (Equation (3.4)) 

We first outline the general proof strategy. See Sections 2.3 and 7.9 of [CLS15] for related arguments. We first 
will show that gradient descent on an approximate estimate of the function / converges. The approximate 
function we use is F[U) := j When the map A is random and isotropic in expectation, 

F{U) can be interpreted as the expected value of f(U), but we stress that our result is a purely deterministic 
result. We demonstrate that F{U) exhibits geometric convergence in a small neighborhood around X. The 
standard approach in optimization to show this is to prove that the function exhibits strong convexity. 
However, due to the rotational degrees of freedom for any optimal point, it is not possible for F{U) to be 
strongly convex in any neighborhood around X except in the special case when r = 1. Thus, we rely on the 
approach used by [CLS15], which establishes a sufficient condition that only relies on first-order information 
along certain trajectories. After showing the sufficient condition holds on F{U), we use standard RIP results 
to show that this condition also holds for the function f{U). 

To begin our analysis, we start with the following formulas for the gradient of f{U) and F{U) 

m 

Vf{U) = '^{Au,UU^ - XX'^)AuU = A*A{UU'^ - XX'^) ■ U, \7F{U) = {UU'^ - XX'^)U . 

/c=l 
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Above, A* : K™ —?► is the adjoint operator of A, i.e. A*{z) = -^kZk- Throughout the proof R is 

the solution to the orthogonal Procrustes problem. That is, 


R = arg min 

RgRnXr., RT 


U-XR 


with the dependence on U omitted for sake of exposition. The following definition defines a notion of strong 
convexity along certain trajectories of the function. 

Definition 5.5. (Regularity condition, [CLS15]) Let X G be a global optimum of a function f. Define 
the set B{S) as 

B{d) := {U e : dist(17, X) < (5} . 


The function f satisfies a regularity condition, denoted by RC(a,/3,(5), if for all matrices U G B{5) the 
following inequality holds: 

{Vf{U),U -XR)>-\\U- XR\\l + \ ||V/(17)||^ . 
a p 

If a function satisfies RC(a,/3,(5), then as long as gradient descent starts from a point Uo G B{S), it will 
have a geometric rate of convergence to the optimum X. This is formalized by the following lemma. 

Lemma 5.6. [CLS15] If f satisfies RC(a,/3, d) and Uq G B{S), then the gradient descent update 

Ur+l ^Ur- P^fiUr), 


with step size 0 < /i < 2//3 obeys Ur G B(5) and 

1 -^) dist2(U'o,X), 

a J 

for all T > 0. 

The proof is complete by showing that the regularity condition holds. To this end, we first show in 
Lemma 5.7 below that the function F{U) satisfies a slightly stronger variant of the regularity condition from 
Definition 5.5. We then show in Lemma 5.8 that the gradient of / is always close to the gradient of F, and 
in Lemma 5.9 that the gradient of / is Lipschitz around the optimal value X. 

Lemma 5.7. Let F{U) = j ^ obeying 

\\U-XR\\ < \ar{X), 

we have 

{VF{U),U -XR)-^ (\\UU^ - + ||(D - XR)U^\\l) 

> lit/ - XR\\l + i \\UU^ - . (5.5) 

Lemma 5.8. Let A be a linear map obeying rank-^r RIP with constant Sir- For any H G R”^’’ and any 
U G R"^’' obeying dist(C/,X) < | ||X||, we have 

\{S7FiU)-S7f{U),H)\<Sir\\UU^ -XX^\\^\\HU^\\^ . 

This immediately implies that for any U G R"^'' obeying dist([7, X) < / ||X||, we have 

||V/(«7) - VF(«7)||^ < ^4. \\UU^ - XX^II^ IIDII . 


dist^(U'r,X) < 


10 







Lemma 5.9. Let A be a linear map obeying rank-Qr RIP with constant Sgr- Suppose that Sg^ < 1/10. Then 
for all U G we have that 

\\UU^ - XX^\\l>^^^\\Wf{U)\\l . 

We shall prove these three lemmas in Sections 5.2.1, 5.2.2, and 5.2.3. However, we first explain how the 
regularity condition follows from these three lemmas. To begin, note that 

(VF(i7), U - XR) = {Vf{U),U - XR) + {VFiU) - Vf{U),U - XR) 

< {Vf{U),U- XR) + ^ \\UU'^ - \\{U - XR)U^\\^ 

< {Vf{U),U - XR) + -[\\UU^ - XX^\\^ + \\iU - XR)U^\\^) (5.6) 

where (a) holds from Cauchy-Schwarz followed by Lemma 5.8, using the fact that Sgr < ^ as assumed in 
the statement of Theorem 3.2 and (b) follows from 2ab < + b^. 

Combining (5.6) with Lemma 5.7 for any U obeying \\U — 2Ci2|| < jar{X), we have 

{Vf{U),U-XR) >^I^\\U-XR\\l + ^\\UU^ -XX^Wl 

> ^ \\U - XR\\l + ||V/(17)||^ 

> ® \\U- XR\\l + l|V/(«7)||^ , (5.7) 

where (a) follows from Lemma 5.9 and (b) follows from the fact that ||t/|| < |||2C|| when dist(17, X) < 
j ||X||. Equation (5.7) shows that f{U) obeys RC(4 /(t^(X), ^ ||X||^ , jar{X)). The convergence result in 
Equation (3.4) now follows from Lemma 5.6. All that remains is to prove Lemmas 5.7, 5.8, and 5.9. 

5.2.1 Proof of the regularity condition for the function F (Lemma 5.7) 

We first state some properties of the Procrustes problem and its optimal solution. Let U,X G and 

define H := U — XR, where R is the orthogonal matrix which minimizes \\U — XR\\p. Let AUB^ be the 
SVD of X^U] we know that the optimal R is R = AB^. Thus, 

U^XR = B-EB^ = {XRf'^U, 

which shows that XR is a symmetric PSD matrix. Furthermore, note that since 

H^XR = U^XR - R^X^XR = {XR)^U - R^X^XR = {XR)^{U - XR) = {XR)^H , 

we can conclude that XR is symmetric. To avoid carrying R in our equations we perform the change 
of variable X G- XR. That is, without loss of generality we assume R = I and that X ^ 0 and 

H^X = X^H. 

Note that for any U obeying dist(17, X) < | ||X|| we have 

WiUU'^ - XX'^)U\\p < ||l717'^-XX'r||^||17|| < ^||X|| ||1717'^ - . 

Using the latter along with the simplifications discussed above, to prove Lemma (5.5) it suffices to prove 
{{UU^ - XX^)U, U-X)-^ {\\uu^ - + ||(C/ - 

>^l^\\U-X\\l + ^\\UU^-XX^Wl. (5.8) 
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Equation (5.8) can equivalently be written in the form 


0 < Tr( [H' HY + ?,H' HH' X + {H' Xy + H' HX' X 

- ( — + i ) [{H^HY + AH^HH^X + 2{H'^XY + 2H^HX^X] 
\ 20 5 / 


1 


- — [{H~^HY + 2H^HH^X + H^HX^X 


20 


Rearranging terms, we arrive at 


0 < Tr( ci{H~^HY + C 2 H^HH^X + cYH'^XY + aH^HX'^X - 

= Tr( + y/^H'^xf + (ci - HY + c^H^HX^X - ). 

2y/^ ' ^ Acs' A ' 


(5.9) 


Here the constants Ci, 02 , 03,04 are defined as 

7 


19 


Cl 02 03 2’ C4 20' 


Since ci < to prove (5.9) it thus suffices to require that 


\Hr < 


2/ 04-1/4 


40 


c|/ 4 c 3 - Cl 


(X) = -./(X) . 


5.2.2 Proof of gradient concentration (Lemma 5.8) 

Define A := UU^ - XX^. Then, 

|(V/(C/)-VF(C/),Jf)| = \{A{A),A{HU^)) - {A,HU^)\ < 64 r\\UU'^ - XX^\\^\\HU^\\^ 

where (a) follows from Lemma 5.1, since rank(A) < 2r and rank(iLt/^) < r. This proves the first part of 
the lemma. To prove the second part, by the variational form of the Frobenius norm, we have 

||V/(J7)-VF(i7)||^= sup {yf{U)-yF{U),H) 

ifeR"XM|ff||p<i 

< 64 r\\UU'^-XX'^\\p sup \\HU'^\\p . 

The result now follows from ||JT{7'''||^ < ||Lf||^ ||C/|| < \\U\\. 

5.2.3 Proof of Lipschitz gradients aronnd optimal solntion (Lemma 5.9) 

Define A := UU~^ — XX^ and <5 := S^r- Suppose we show that 

\\AiA)\\l-Y\yf{U)\\l>^\\A\\l , (5.10) 

where 7 := l/2||ll/|p. Then by 2r-RIP we have 

i ^ < (1 + <^) ll^ll^ ^ + A) ’ 
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which yields the claim after rearranging. 

We now focus on proving Equation (5.10). Recall Vf{U) = ^*^(A) • U, which implies ||V/(17)|||. = 
(^(A), ^(^*^(A) • UU^)). Using this equality we have 

M(A)|||,-7||V/(17)|||. 

= (^(A),^(A) - 7^(^M(A) • UU^)) 

= (^(A),yl(A - 7ylM(A) • UU^)) 

> (A, A - 7^M(A) • t/C/T) - 5 ||AIIj^IIA - 7 ^M(A) • UU^\\f 

= (A, A) - 7(A,ylM(A) • UU'^) - ^11 A||i.|| A - 7ylM(A) • UU'^\\f 
= (A, A) - 7 (^(A[/C/T),^(A)) - 5||AIIj^IIA - 7 ^M(A) • UU^\\f 

{h) 

> (A, A) - 7(AC/[/T, A) - 7^11 A«7t/T||^|| All;^ - ^|| A||;^|| A - 7^M(A) • UU^\\f , (5.11) 

where both (a) and (b) hold by Lemma 5.1 since rank(A) < 2?’, rank(A — 7 yl*yl(A) • UU^) < 3r, and 
rank(AI[/i7^) < r. We now control || A — ^A*A{^) ■ UU^\\f from above. Using the variational form of the 
Frobenius norm, 

||A-7yl*yl(A)-Lr{7'^||F= sup {A--iA*A{A)-UU^ ,V) 

veR"X".||v||p<i 

sup (A,y)- 7 (ylM(A) • {7C/'^,y) 

sup --i{A*A{A),VUU^) 

veR"xr.,||v||p<i 

sup (A,y)-7 (yl(A),yl(yi7{7T)) 

veR"xr.,||v||p<i 

< sup {A,V)--f{A,VUU^)+ j6\\A\\f\\VUU^\\f 

VgR"X"_||v||p<1 

sup {A{I-^UU^),V)+^6 \\A\\f\\VUU^\\f 

veR"x™,||v||p<i 

(t>) 

< \\A-'fAUU'^\\F + l6\\A\\F\\UU^\\, 

where (a) holds again by Lemma 5.1 since rank(V'L117^) < r, and (b) holds since \\VUU^\\f < ||V'||^ ||L1I!7^|| < 
||17{7^||. Plugging this upper bound into Equation (5.11), we have 

M(A)|||-7||V/(17)||^ 

> (A, A) - 7(AJ7[/T, A) - 7^11 At/J7T||^||All;^ - <5|| A||f || A - jAUU^\\f - iS^A\\%\\UU'^\\ 

> (A, A) - ^\\UU^\\{A, A) - ^6\\A\\UUU^\\ - -5|| A|||||/ - ^UU^\\ - -f6^A\\%\\UU^\\. 

(5.12) 

where (a) holds since {AUU~^, A) < ||C/|p(A, A). Using the fact that 6 < 1/10 (5.12) implies, 

\\A{A)\\l-j\\Vf{U)\\l 

> (^1 - 7a?(C/) - ^cT?(C/) - ^cT^iiU) - ^11/ - lUU^W^ II A||^ . (5.13) 

For 7 = 2 li^’ have 

||/_ 7 J 7 J 7 T|| =max(|l-7(T?([/)|, |1 - 7 (t/([/)|) = max 1 - = i _ < p 
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Plugging this bound into Equation (5.13), we get 


P(A)|||,- 


2\\Uf 


l|V/(«7)||i > [1-1 - — - — 
II M ;iiF - I 2 20 200 



l|A|||>i||A|| 


2 

F ' 


5.3 Proof of initialization (Equation (3.3)) 

Using Lemma 5.1, we can conclude that p{A) from Lemma 5.2 is bounded by p{A) < 264 ^. < 1/5. Setting 
Mq = Onxn and applying Lemma 5.2 to our initialization iterates, we have that 


Mr - XX~^ 


<(l/5)^||XXT||^<(l/5)n|X|| ||X||^ . 


From Lemma 5.4, we have that 


dist([/o,X) < 


<Jr{X) 


Mr - XX^ 


< V2ii/5yv^\\x\ 


F • 


Hence, if we want the RHS to be upper bounded by ^ar{X)^ we require 

V2(l/5)-v^||X||^ < ia.(X) (1/5)- < ^ /’;i^tii 

4 ^y/2y/K\\X\\p 

Since ||X||^ < 'Jr ||A^||, it is enough to require that 

T > log{y/rJ + 2 • 

Similarly, it is easy to check that if r satisfies (5.14), then 


(5.14) 


Mf - XX^ 


< 


Mf - XX^ 


<a/(X)/4, 


also holds. 


5.4 Proof for rectangular matrices (Theorem 3.3) 

We now turn our attention to the general case where the matrices are rectangular. Recall that in this case, 
we want to recover a fixed but unknown rank-r matrix M S from linear measurements. Assume 

that M has a singular value decomposition of the form M = AUB^. Define X = AlA^^ € and 

Y = With this piece of notation the iterates Ur € V/- € in Algorithm 2 

can be thought of as estimates of X and Y. The proof of the correctness of the initialization phase of 
Procrustes Flow (Theorem 3.3, Equation (3.6)) in the rectangular case is similar to the PSD case (Theorem 
3.2, Equation (3.3)) and is detailed in Section 5.4.3. In this section we shall focus on proving the convergence 
guarantee provided in Theorem 3.3, Equation (3.7). 

To simplify exposition we aggregate the pairs of matrices {U,V), {Ur,Vr), (X,!-), and {X,—Y) into 
larger “lifted” matrices as follows 


W := 





and 


Z := 


X 

-Y 


Before we continue further we hrst record a few simple facts about these new variables which we will utilize 
multiple times in the sequel. First, note that for i = 1,2, ...,r, we have crj{Z) = aJZ) = 2a(,{M). Also, 
since X^X = Y^Y = S, we have Z^Z Z'^Z ~ O^xr- 
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To prove Theorem 3.3, Equation (3.7), we will demonstrate that the function g{W) := V) over the 
variable W has similar form to f{U) over the variable U. To see this connection clearly, we need a few 
useful block matrix operators and definitions. Let Sym : ^ ]^nixn 2 defined as 


Sym(A) := 


0 


ni xni 


at 


A 


0 


712 Xn 2 


We note for future use that with this notation we have Sym(AT) = ^{ZZ^ — ZZ~^). Given a block matrix 
A G K("i+’^ 2 )x(ni+n 2 ) partitioned as 


A = 


An 

^21 


Ai2 
A 22 ’ 


with All G Ai 2 GM’"i''”G A 2 iGM"=''”S and A 22 G , 


we define the linear operators Pdiag and VoS from k("’ 1 +"- 2 )x("’ 1 +" 2 ) —^ ]^(™i+n 2 )x(ni+n 2 ) g^g follows 


T"diag(24) 


An 

0n2 X ni 


Oni Xn 2 
24-22 


, VosiA) := 


OniXni 24 i2 
-421 0n2Xn2 


Our final piece of notation is an augmented measurement map which works over lifted matrices, which we 
call B. The map B : R("i+" 2 )x(ni+n 2 ) —^ jg dgUaeii gs 


B{X)k := (-Bfe, X), Bk := Sym(Afe) . 


In this lifted space the function g takes the form 

g{W) := g{U, F) = ^ ||A(17yT) _ b||J^ + ^ \\u^u - yTy||^ 

= i \\B (Sym (C/yT) _ gym (M)) ^ ||Sym {UV^) - Sym (M)||^ 

Note that the updates of the Procrustes Flow algorithm in Equations (2.5) and (2.6) are based on the 
gradients \7ug{U,V) and Vvg{U,V) given by 

Vug{u, V) = ^(Afc, t/yT - M)AkV + ^U{U^U - yTy) 

k^l 

Vvg{u, V) = ^(Afe, uv^ - M)Alu + iy(y^y - u^u). 


One can easily verify that this update has the following compact representation in terms of the lifted space 
VgiW) = 


Vug{U,V) 

Vvg{U,V) 


= ^B*B{WW^ - Sjm{M))W + i(iPdiag - ros)iWW^)W . 


As in the proof for the PSD case, the crux of Theorem 3.3 lies in establishing that the regularity condition 


{Vg{W),W-ZR) >^d^\\W-ZR\\l + 


16 


1683||M|| 


l|V5(W^)|| 


F 1 


(5.15) 


holds for all ly G Rl^-i+^slxr obeying dist(iy,Z) < (M). Assuming that this condition holds, 

we have that g{W) obeys RC{8/ar{M), \\M\\ , ^-nd hence Theorem 3.3, Equation (3.7) 

immediately follows by appealing to Lemma 5.6. 

To prove (5.15), we make use of the similarity of the expressions with the PSD case. We start, as before, by 
defining a reference function F{W) := \ \\WW~^ - ZZ~^\\l, with gradient VF(W) = {WW~^ ~ ZZ~^)W. 
We now state two lemmas relating g and F, which together immediately imply (5.15). The first lemma 
relates the regularity condition of g to that of F by utilizing RIP. The second lemma provides a Lipschitz 
type property for the gradient of g. 
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Lemma 5.10. Assume the linear mapping A obeys Ar-RIP with constant Sir- Then g obeys the following 
regularity condition for any W € and R € 


{yg{W),W 


ZR) > \\WW'^ - zz'^W^l 


(W - ZR)W^ 

1 




ZZ^W 


(5.16) 


Lemma 5.11. Let A be a linear map obeying rank-dtr RIP with constant Sgr < 1/10. Then for all W £ 
]g(ni+n2)xr satisfying dist(W,Z) < i \\Z\\ , we have that 


21 

400 


WW^ - 


2 ^ 1 


ZZ^VL 


2 

> 

F 


16 1 
1^ ||M|| 


l|V5(W^)||^ 


(5.17) 


With these lemmas in place we have all the elements to prove (5.15). We use Lemma 5.10, Equation (5.16) 
together with the inequality 2ab < + b^ to conclude that, 

Sir 


{Vg{W),W - ZR) > —^ (|| WW"^ - ZZ"^ 


\l + \\iW-ZR)W^\\l 


i(VF(T4^),W-Zi7) + ^ 


ZZ^VE 


(5.18) 


By assumption dist(W, Z) < icrr.(Z), so we can apply Lemma 5.7 to {VF{W), W — ZR), which combined 
with (5.18) yields 


{Vg{W),W - ZR) > (— - —) \\WW^ - ZZ'^\ 


iL-if)\\iW-ZR)W-t 


+ 


.(M) 


|EE-Z 1 ^|| 


21 

400 


llEEW'^ ZZ'^\\i+ 

ZZ^EE 

II 11^ 8||M|| 



(5.19) 


Applying Lemma 5.11 together with Sir < 1/25 to (5.19) completes the proof of (5.15) and hence the 
theorem. All that remains is to prove Lemma 5.10 and 5.11, which we do in Sections 5.4.1 and 5.4.2, 
respectively. 


5.4.1 Relating the regularity condition of g and F (Lemma 5.10) 

We begin the proof of Lemma 5.10 with the following RIP inequality about the map B. The proof of this 
lemma is almost identical to the proof of Lemma 5.1, so we omit the details. 

Lemma 5.12. Suppose A is 2r-RIP with constant S 2 r, o,nd B is constructed from A as described above. For 
any rank-r matrices X,Y £ ]R("'i+" 2 )x(ni+n 2 )^ have 

\{BiX),B{Y))-{Vr.siX),Vr,siY))\<S2r\\rosiX)\\p\\Vr>siY)\\p . 

To relate the gradients Vg{W) and VF{W) we first make a few manipulations to Vg(VE). Define 
A := WW^ - Sym(M). We have 

Vg{W) = ^B*B{A)W + i(iPdiag - Vos){WW^)W 

= i(S*S(A) - VosiA))W + iiPoff(A)EE + i(iPdiag - Vos){WW^)W 
= ^iB*B{A) - Vos{A))W + ^Vos{WW^)W - ^Vos{Sym{M))W 
+ \rdi.g{WW^)W - ^Vos{WW^)W 
= i(S*S(A) - PosiA))W + ^{Vai.g + Vos){WW^)W - ^Sym{M)W 
= i(S*e(A) - ros{A))W + ^iWW~^ - 2Sym{M))W. (5.20) 
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Taking inner products of both sides of Equation (5.20) gives us 


{Vg{W), W-ZR) = - PoS(A))PE, W - ZR) 


The first term is simple to control with RIP. Observe that 


-^{{WW^ - 2Sym(M))W, W - ZR) . 

(5.21) 


{{B*B{A) - Vos{A))W, W - ZR) 

= {B*B{A) - Vos{A), {W - ZR)W^) 

= {B{A),B{{W - ZR)W^)) - {Vos{A),Vos{{W - ZR)W^)) 

> -hr \\WW'^ - ZZ'^W p ||(W - ZR)W ^\\p , 


(5.22) 


where (a) follows from Lemma 5.12. 

We now relate the second term to the gradient of F. By exploiting the structure of Z and Z, we have 


-2SYm{M),W - ZR) 



{{WW^ - ZZ'^)W, W - ZR) + {ZZ'^W, W - 

{{WW^ - ZZ^)W, W - ZR) + Tr(ir' 

(c) 1 

> {{WW^ ZZ^)W,W ZR)+ ^ 

||Z||2 

^ZZ'^W) 

zz^w 

- {VF(W),W ZR)+ 

ZZ^W 

2 

7 


(5.23) 


where (a) holds because 2Sym(AT) = ZZ^ — ZZ^, (b) holds because Z^Z = O^xr, (c) holds because 


ZZ^W 


= Tr(W'^ZZ'^ZZ~^W) < ai(Z~^Z)Tr(W~^ZZ~^W), and (d) holds since ||Zf = 2||M||. The 


proof of Lemma 5.10 now follows from combining (5.21) with (5.22) and (5.23) . 


5.4.2 Lipschitz-gradient type condition for g (Lemma 5.11) 

The left-hand side of (5.17) has two terms. We start by bounding the second term. Fix any £ > 0. Then, 


1 




ZZ^W 


1 


(a) 

> 


> 

(6) 

> 


;||M|| ' 

1 

1 


K^diag - Vos){WW^)W + (Pdiag - Vos)iZZ^ - WW^)wfp 

klPdiag - Voii){WW^)W\t - e IKiPdiag - Vos){ZZ'^ - WW'^)W\\ 


1 + e 


8||M|| l + £ 
1 £ 


\{'Pdi.g-ros)iww^)wL- 


e\\W\\ 


\\M\\ 


\WW^ -ZZ^\ 


T^TTrl|2 „25 II J -7-7T||2 


8 IIMII 1 +. ZZ 


(5.24) 


Here, (a) holds since by Young’s inequality for any £ > 0, we have (a — b)^ > — eb'^ and (b) holds since 

2 ||M|| = ||Zf and||iy||< f||Z||. 

To bound the first term in left-hand side of (5.17), we state a lemma which shows that our augmented 
measurement map B obeys a similar Lipschitz property to that of A stated in Lemma 5.9. The proof of this 
lemma is nearly identical to that of Lemma 5.9, and requires minor modifications to deal with the projection 
operator VoS- We omit the details. 
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Lemma 5.13. Let A be as in the hypothesis of Lemma 5.9. Then for all W, Z G K(”i+’^ 2 )xr^ have 


IVosiWW^ - ZZ^)\\l>^^ ^ 


'F'- 17\\wf 

With this lemma in place, note that for any 7 > 0, 


\B*B{WW'^ - ZZ'^)W\ 


1 9 (<^) 

±\\ww^-zz^\\^ > 

(b) 

> 


1 


\\B*B{WW^ - ZZ'^)W\ 

\B*B{WW^ - ZZ'^)W\ 
2 


B*B{A)W 


M\\wr 

4 

425||M|| 

16 

425||M|| 

16 

425||M|| 

' 4251 M (t^ - 'l’..)iWW^)wt 


^9{W) - -(Pdiag - Pos){WW^)W 


16 


7 


425||M|| 1 + 7 


\\^9iW)\\l- 


1 


425||M|| 


\{Vai.^-Vo«){WW^)W\ 


(5.25) 


Here, (a) follows from Lemma 5.13, (b) follows because ||W^|| < I ||Z||, and (c) is another application of 
Young’s inequality. Combining (5.24) and (5.25) with the hypothesis that 5/^r < 1/25, and setting e: = 4/625, 
7 = 25/74 completes the proof. 


5.4.3 Proofs for the initialization phase of Algorithm 2 (Theorem 3.3, Equation (3.6)) 

We start with the following generalization of Lemma 5.4, the proof of which is deferred to Appendix C. 

Lemma 5.14. Let Mi, M 2 G jg rank r matrices with SVDs of the form Mi = Ui'EiV^ and 

M 2 = U 2 'S 2 V^. For i = 1,2, define Xi = G and Yi = e Furthermore, 

assume Ml and M 2 obey \\M 2 — Mi\\ ^ ¥ r{Mi). Under these assumptions the following inequality holds 


dist^ 


X2 

Y 2 


XIA < 2 \\M2-Mi\\l 

.^lJ/“X-l 0 -r{Mi) 


The rest of the proof proceeds similarly to the proof of Equation (3.3). Using Lemma 5.1, we conclude that 
p{A) from Lemma 5.2 is bounded by p{A) < 2S4r < 2/25. Setting Mq = 0nixn2 applying Lemma 5.2 
to our initialization iterates, we have that 


- M 


<{2/25r\\M\\p . 


In order for the RHS to be bounded above by r must satisfy 


r > 




When this happens. Lemma 5.14 tells us that 


dist2(Wo,Z) < 


M^-M 


V 2 -I 


.(M) 


(5.26) 


(5.27) 
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In order for this RHS to be bounded above by we require that 

Equation (5.26), it is sufficient for r to satisfy 


2 


Mr-M 


< Using 


T > log(25/2) log 


[7 

'v ' <Jrm) ■ 


Since ||M||^ < -yr setting Tq as Tq > 31og(-yrK) + 5 satisfies both (5.27) and (5.28). 


(5.28) 
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A Proof of Lemma 5.4 

Define H = U—XR. Similar to the discussion at the beginning of Section 5.2.1, without loss of generality we 
can assume that (a) R = I, (b) U~^X ^ 0, and (c) H^X = X^H. With these simplifications, establishing 
the lemma is equivalent to showing that 

+ AH^HH^X + 2{H'^X)^ + 2X^XH^H - r]H^H) > 0 (A.l) 
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holds with r] = 


We note that 


2(%/2-1)(t2(x) ■ 

+ V2H'^Xf + (4 - 2V2)H^HH^X + 2X^XH^H - rjH^H) 

= TiiiH^Hf + AH^HH^X + 2{H'^Xf + 2X^XH^H - r]H^H) 

Hence, a sufficient condition for (A.l) to hold is 

(4 - 2y/2)H'^X + 2X'^X - 77 /^ ^ 0 . 

Recalling that X = X — X^X, and that U~^X ^ 0, we have 

(4 - 2V2)H^X + 2X^X - r]Ir = (4 - 2V2)U^X + (2 - (4 - 2V2))X^X - r]Ir 

= (4 - 2y/2)U'^X + 2{y/2 - l)X'^X - rjlr . 

Since U'^X > 0, to show (A.2) it suffices to show 


2(V2 - \)X'^X -r]IrhO<^ X~^X h 


2(72-1) 


Ir. 


The RHS trivially holds, concluding the proof. 


(A.2) 


B Proof of Lemma 3.4 

From RIP and the assumption that 52r < 1/10, we have 


- XX^ 

By Weyl’s inequalities, this means that 


< 


- XX^ 


<r /lO 
F V 9 


a‘^{X) > ar{Mr) - \l —Cj 


Lemma 5.4 ensures that 


dist(C/o,^) < 


3 1 


Mr - XX^ 


2ar{X) 

We can upper bound the RHS by the following chain of inequalities. 


3 1 /3 1 /To 

2^A^ ^ F - \l 2^^(^\l 

'"\/l 7 x) 2/1 

\/l 7 v) 


where (a) follows from RIP, (b) follows since Cr < ■^ar{M) implies that 


'10 1 
— Cr < -^ 

9 “ 276 


ar{M) - \l —er I , 


(B.l) 


and (c) follows by (B.l). 
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C Proof of Lemma 5.14 

To begin with note that by the dilation trick we have for £ = 1,2, 
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By simple algebraic manipulations we have 
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Furthermore, 
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Applying Weyl’s inequality to (C.2), we have 
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M^2 — 
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(C.3) 


Applying Lemma 5.4 to the matrices 
we conclude that 
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and utilizing equations (C.l) and (C.3) 
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Or (Ml) 

Let ASB^ be the singular value decomposition of XjX 2 + Y^Y 2 . It is easy to verify that the solution R to 
the orthogonal Procrustes problem (equivalently the optimal rotation) between 


'Xi 

Xa' 

and 
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is equal to i? = 
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. From this we conclude that 
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Plugging the latter in (C.4) concludes the proof. 
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D Proof of the first inequality in Equation (3.8) 


The first inequality is immediate from the following lemma. 
Lemma D.l. Let U,X € and V,Y € Suppose that 



Then, we have that 


~ 4^2 


X 

Y 


dist 



Proof. Put W := 


and Z := 


U 

V 

Lemma 5.3 to \\WW^ - ZZ'^\ 
result now follows. 


. We have that 


we conclude that 


UV^ - XyT||^ < ^ ||n"n"T _ ZZ'^W^. Applying 

.^\\WW^ - ZZ^W^ < ^||Z||dist(ty,Z). The 
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